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Abstract. We present methods for evaluating human and automatic taggers that 
extend current practice in three ways. First, we show how to evaluate taggers that 
assign multiple tags to each test instance, even if they do not assign probabilities. 
Second, we show how to accommodate a common property of manually constructed 
"gold standards" that are typically used for objective evaluation, namely that there 
is often more than one correct answer. Third, we show how to measure performance 
when the set of possible tags is tree-structured in an IS-A hierarchy. To illustrate 
how our methods can be used to measure inter-annotator agreement, we show how 
to compute the kappa coefficient over hierarchical tag sets. 

u 

1. Introduction 

Objective evaluation has been central in advancing our understanding 
£> ' of the best ways to engineer natural language processing systems. A 

major challenge of objective evaluation is to design fair and informative 
evaluation metrics, and algorithms to compute those metrics. When the 
task involves any kind of tagging (or "labeling"), the most common per- 
formance criterion is simply "exact match," i.e. exactly matching the 
right answer scores a point, and no other answer scores any points. This 
measure is sometimes adjusted for the expected frequency of matches 
q , occuring by chance (Carletta, 1996). Resnik and Yarowsky (1997; to 

appear), henceforth R&Y, have argued that the exact match criterion 
is inadequate for evaluating word sense disambiguation (WSD) systems. 

R&Y proposed a generalization capable of assigning partial credit, 
thus enabling more informative comparisons on a finer scale. In this 
article, we present three further generalizations. First, we show how 
to evaluate non-probabilistic assignments of multiple tags. Second, we 
show how to accommodate a common property of manually constructed 
"gold standards" that are typically used for objective evaluation, namely 
that there is often more than one correct answer. Third, we show how 
to measure performance when the set of possible tags is tree-structured 
in an IS-A hierarchy. To illustrate how our methods can be applied to 
the comparison of human taggers, we show how to compute the kappa 
coefficient (Siegel and Castellan, 1988) over hierarchical tag sets. 



o 
o 



AugustOO.tex; 1/02/2008; 18:34; p.l 



2 



Table I. Hypothetical output of four WSD systems on a test instance, where the 
correct sense is (2). The exact match criterion would assign zero credit to all four 
systems. Source: (Resnik and Yarowsky, 1997) 



WSD System 



sense of interest (in English) 


1 


2 


3 


4 


(1) monetary (e.g. on a loan) 


.47 


.85 


.28 


1.00 


(2) stake or share COITect 


.42 


.05 


.24 


.00 


(3) benefit/advantage/sake 


.06 


.05 


.24 


.00 


(4) intellectual curiosity 


.05 


.05 


.24 


.00 



Our methods depend on the tree structure of the tag hierarchy, 
but not on the nature of the nodes in it. For example, although these 
generalizations were motivated by the senseval exercise (Palmer and 
Kilgarriff, this issue), the mathematics applies just as well to any tag- 
ging task that might involve hierarchical tag sets, such as part-of-speech 
tagging or semantic tagging (Chinchor, 1998). With respect to word 
sense disambiguation in particular, questions of whether part-of-speech 
and other syntactic distinctions should be part of the sense inventory 
are orthogonal to the issues addressed here. 



2. Previous Work 

Work on tagging tasks such as part-of-speech tagging and word sense 
disambiguation has traditionally been evaluated using the exact match 
criterion, which simply computes the percentage of test instances for 
which exactly the correct answer is obtained. R&Y noted that, even if 
a system fails to uniquely identify the correct tag, it may nonetheless 
be doing a good job of narrowing down the possibilities. To illustrate 
the myopia of the exact match criterion, R&Y used the hypothetical 
example in Table I. Some of the systems in the table are clearly bet- 
ter than others, but all would get zero credit under the exact match 
criterion. 

R&Y proposed the following measure, among others, as a more 
discriminating alternative: 

S cor e( A) = Pjc(c\w, context (w)), (1) 

In words, the score for system A on test instance w is the probability 
assigned by the system to the correct sense c given w in its context. In 
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the example in Table I, System 1 would get a score of 0.42 and System 4 
would score zero. 



The generalizations below start with R&Y's premise that, given a 
probability distribution over tags and a single known correct tag, the 
algorithm's score should be the probability that the algorithm assigns 
to the correct tag. 

3.1. Non-probabilistic Algorithms 

Algorithms that output multiple tags but do not assign probabilities 
should be treated as assigning uniform probabilities over the tags that 
they output. For example, an algorithm that considers tags A and B 
as possible, but eliminates tags C, D and E for a word with 5 tags in 
the reference inventory should be viewed as assigning probabilities of .5 
each to A and B, and probability to each of C, D, and E. Under this 
policy, algorithms that deterministically select a single tag are viewed 
as assigning 100% of the probability mass to that one tag, like System 4 
in Table I. These algorithms would get the same score from Equation 1 
as from the exact match criterion. 

3.2. Multiple Correct Tags 

Given multiple correct tags for a given word token, the algorithm's 
score should be the sum of all probabilities that it assigns to any of 
the correct tags; that is, multiple tags are interpreted disjunctively. 
This is consistent with instructions provided to the senseval annota- 
tors: "In general, use disjunction ... where you are unsure which tag to 
apply" (Krishnamurthy and Nicholls, 1998). In symbols, we build on 
Equation 1: 



where t ranges over the C correct tags. Even if it is impossible to know 
for certain whether annotators intended a multi-tag annotation as dis- 
junctive or conjunctive, the disjunctive interpretation gives algorithms 
the benefit of the doubt. 



3. New Generalizations 



c 





t=l 
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3.3. Tree-structured Tag Sets 

The same scoring criterion can be used for structured tag sets as for 
unstructured ones: What is the probability that the algorithm assigns 
to any of the correct tags? The complication for structured tag sets is 
that it is not obvious how to compare tags that are in a parent-child 
relationship. The probabilistic evaluation of taggers can be extended 
to handle tree-structured tag sets, such as hector (Atkins, 1993), if 
the structure is interpreted as an is- A hierarchy. For example, if word 
sense A. 2 is a sub-sense of word sense A, then any word token of sense 
A.2 also IS-A token of sense A. 

Under this interpretation, the problem can be solved by defining two 
kinds of probability distributions: 

1. Pr(occurrence of parent tag|occurrence of child tag) 

2. Pr (occurrence of child tag | occurrence of parent tag). 

In a tree-structured IS-A hierarchy Pr(parent (child) = 1, so the first 
one is easy. The second one is harder, unfortunately; in general, these 
("downward") probabilities are unknown. Given a sufficiently large 
training corpus, the downward probabilities can be estimated empiri- 
cally. However, in cases of very sparse training data, as in senseval, 
such estimates are likely to be unreliable, and may undermine the 
validity of experiments based on them. In the absence of reliable prior 
knowledge about tag distributions over various tag-tree branches, we 
appeal to the maximum entropy principle, which dictates that we as- 
sume a uniform distribution of sub-tags for each tag. This assumption 
is not as bad as it may seem. It will be false in most individual cases, 
but if we compare tagging algorithms by averaging performance over 
many different word types, most of the biases should come out in the 
wash. 

Now, how do we use these conditional probabilities for scoring? The 
key is to treat each non-leaf tag as under-specified. For example, if sense 
A has just the two subsenses A.l and A.2, then tagging a word with 
sense A is equivalent to giving it a probability of one half of being sense 
A.l and one half of being sense A.2, given our assumption of uniform 
downward probabilities. This interpretation applies both to the tags in 
the output of tagging algorithms and to the manual (correct, reference) 
annotations. 

4. Example 

Suppose our sense inventory for a given word is as shown in Figure 1. 
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Figure 1. Example tag inventory. 

Table II. Examples of the scoring scheme, for the tag inventory in Figure 1. 



Manual Annotation 


Algorithm's Output 


Score 


B 


A 





A 


A 


1 


A 


A.l 


1 


A 


A.lb 


1 


A.l 


A 


.5 


A.l and A. 2 


A 


.5 + .5 = 1 


A.la 


A 


.25 


A. la and B.2 


B 


Pr(S.2|S) = | 


A.la and B.2 


A.l 


.5 


A.la and B.2 


A.l and B.2 


.5 x .5 + .5 x 1 = .75 


A.la and B.2 


A.l and B 


.5 x .5 + .5 x .333 = .41666 



Under the assumption of uniform downward probabilities, we start by 
deducing that Pr(Al|yl) = .5, Pr(Ala|Al) = .5, (so Pr(A.la|A) = .25 ), 
Pi(B.2\B) = i, and so on. If any of these conditional probabilities is 
reversed, its value is always 1. For example, Pr(^4|Ala) = 1. Next, 
these probabilities are applied in computing Equation 2, as illustrated 
in Table II. 



5. Inter- Annotator Agreement Given Hierarchical Tag Sets 

Gold standard annotations are often validated by measurements of 
inter- annotator agreement. The computation of any statistic that may 
be used for this purpose necessarily involves comparing tags to see 
whether they are the same. Again, the question arises as to how to 
compare tags that are in a parent-child relationship. We propose the 
same answer as before: Treat non-leaf tags as underspecified. 
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To compute agreement statistics under this proposal, every non-leaf 
tag in each annotation is recursively distributed over its children, us- 
ing uniform downward probabilities. The resulting annotations involve 
only the most specific possible tags, which can never be in a parent- 
child relationship. Agreement statistics can then be computed as usual, 
taking into account the probabilities distributed to each tag. 

One of the most common measures of pairwise inter-annotator agree- 
ment is the kappa coefficient (Siegel and Castellan, 1988): 

_ Pr(A)-Pr(£) 
K ~ 1 - Pt(E) [6) 

where Pr(A) is the proportion of times that the annotators agree and 
Pr(E) is the probability of agreement by chance. Once the annotations 
are distributed over the leaves L of the tag inventory, these quantities 
are easy to compute. Given a set of test instances T, 

Pr(,4) = — ^^Pr(Z|annotationi(t)) • Pr (7 1 annotation 0)) (4) 

' ' teT leL 

Pr(£)=^Pr(Z) 2 (5) 

zez, 

Computing these probabilities over just the leaves of the tag inventory 
ensures that the importance of non-leaf tags is not inflated by double- 
counting. 



6. Conclusion 

We have presented three generalizations of standard evaluation meth- 
ods for tagging tasks. Our methods are based on the principle of max- 
imum entropy, which minimizes potential evaluation bias. As with the 
R&Y generalization in Equation 1, and the exact match criterion before 
it, our methods produce scores that can be justifiably interpreted as 
probabilities. Therefore, decision processes can combine these scores 
with other probabilities in a maximally informative way by using the 
axioms of probability theory. 

Our generalizations make few assumptions, but even these few as- 
sumptions lead to some limitations on the applicability of our proposal. 
First, although we are not aware of any algorithms that were designed 
to behave this way, our methods are not applicable to algorithms that 
conjunctively assign more than one tag per test instance. A potentially 
more serious limitation is our interpretation of tree-structured tag sets 
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as IS-A hierarchies. There has been considerable debate, for example, 
about whether this interpretation is valid for such well-known tag sets 
as hector and WordNet. 

This work can be extended in a number of ways. For example, it 
would not be difficult to generalize our methods from trees to hierar- 
chies with multiple inheritance, such as WordNet (Fellbaum, 1998). 
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